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SUMMARY 


This  paper  reviews  generalizability  theory  and  discusses  its  applicability  to  the  Air  Force  Job  Performance 
Measurement  Project  It  is  concluded  that  generalizability  theory  has  relevance  for  the  project.  Application 
of  the  theory  Is  illustrated  with  analyses  of  performance  data  collected  from  Air  Force  jet  engine  mechanics. 
Error  variance  In  measurement  is  estimated  for  rating  sources  (incumbents,  supervisors,  and  peers),  rating 
forms,  and  specific  items  included  on  each  form.  Possible  measurement  conditions  eliciting  desirable 
degrees  of  generalizability  are  presented.  The  relationship  of  generalizability  theory  to  construct  validity  and 
the  logical  requirements  for  performance  ratfngs  are  discussed. 
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PREFACE 


The  Air  Force  Job  Performance  Measurement  Project  is  a  remarkably  broad, 
encompassing  attempt  to  assess  individual  job  proficiency.  Within  the  tested  specialties, 
incumbents  are  assessed  via  a  Walk-Through  Performance  T est  (with  hands-on  and  interview 
components),  and  subjective  performance  evaluations.  The  performance  ratings 
themselves  are  broad,  as  incumbents  are  evaluated  on  four  different  forms  (varying  in  task 
specificity)  by  themselves,  their  peers,  and  supervisors. 

Collecting  so  much  data  for  each  Air  Force  specialty  would  be  impractical  in  the  long 
run.  Thus,  one  goal  of  the  project  staff  is  to  reduce  the  total  number  of  measures  collected, 
by  empirically  evaluating  and  comparing  the  measures  as  data  are  collected  from  new 
specialties.  Another  goal  is  to  modify  and  improve  measures  which  will  ultimately  be  retained. 
Generaiizability  theory  is  ideally  suited  to  both  goals.  Generalizability  theory  is  useful  for 
investigating  whether  scores  on  any  measurement  instrument  are  dependable  over  varying 
conditions  of  measurement.  Both  rating  forms  and  rating  sources  can  be  considered 
measurement  conditions.  If  scores  are  found  to  be  generalizable,  then  the  number  of 
conditions  sampled  can  be  reduced,  with  minimal  losses  in  generalizability.  In  fact, 
generalizability  theory  can  forecast  the  resulting  dependability  indices  for  differing  sets  of 
measurement  conditions.  Thus,  it  can  provide  decision-makers  with  answers  to  questions 
such  as:  How  dependable  would  our  rating  system  be  if  we  used  only  supervisory  ratings 
on  a  single  form?  Similarly,  another  relevant  question  would  be  how  the  measurement 
process  could  be  improved  by  increasing  the  number  of  items  on  each  form.  Generalizability 
theory  can  forecast  the  effects  of  dependability  for  these  modifications  as  well.  Thus, 
generalizability  theory  appears  to  have  considerable  relevancy  to  the  performance 
measurement  project.  These  issues  are  discussed  and  illustrated  in  this  paper. 

Several  individuals  warrant  acknowledgement  for  their  assistance  in  the  preparation  of 
this  paper.  Both  Lt  Colonel  Rodger  Ballentine  and  Dr  Jerry  Hedge  have  been  extremely 
helpful  during  my  association  with  the  Laboratory.  They  have  helped  pull  the  strings  that 
were  needed  on  various  occasions  during  the  project.  Dr  Terry  Dickinson  provided 
numerous  technical  comments  and  useful  suggestions.  Finally,  MarkTeachout  has  excelled 
as  a  contract  monitor,  pulling  the  statement  of  work  together,  providing  data  or  information 
upon  request,  reviewing  drafts,  and  keeping  the  project  close  to  schedule. 
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GENERALIZABILITY  THEORY:  AN  ASSESSMENT  OF  ITS 
RELEVANCE  TO  THE  AIR  FORCE 
JOB  PERFORMANCE  MEASUREMENT  PROJECT 


INTRODUCTION 

General izability  theory  was  developed  by  Cronbach  and  associates  (Cronbach,  Gleser,  Nanda,  & 
Rajaratnam,  1972;  Cronbach,  Rajaratnam,  &  Gleser,  1 963)  as  an  alternative  to  classical  test  theory.  Whereas 
classical  test  theory  posited  a  single  true  score  and  a  single  undifferentiated  error  term,  Cronbach  and 
associates  recognized  that  measurement  error  may  be  introduced  by  any  number  of  sources,  and  advocated 
multifaceted  experiments  to  estimate  the  error  variance  due  to  each  source.  Further,  they  replaced  the 
concept  of  a  single  true  score  with  a  universe  score,  the  value  of  which  could  vary  by  the  nature  of  the  error 
source  and  the  population  of  generalization.  Thus,  generalizability  theory  offers  a  more  complex,  yet  more 
realistic  portrayal  of  measurement  error.  Generalizability  theory  contributes  to  measurement  theory  in  other 
ways  as  well,  by  making  certain  measurement  assumptions  explicit.  For  example,  the  researcher  must 
specify  whether  the  test  is  to  be  used  for  absolute  (criterion-referenced)  or  relative  (norm-referenced) 
decision-making,  as  the  degree  of  error  will  vary  by  purpose.  The  many  contributions  of  generalizability 
theory  to  measurement  theory  will  be  explained  below.  First,  the  theory  itself  must  be  explained  in  greater 
detail.  To  do  so,  I  will  first  discuss  some  ambiguities  inherent  in  classical  test  theory. 

Classical  test  theory  proposes  that  any  observed  score  for  a  person  on  an  instrument  can  be 
decomposed  into  two  parts:  true  score  and  error.  Since  these  two  components  are  assumed  to  be 
independent,  it  follows  that  observed  score  variance  (a? )  can  also  be  decomposed  into  true  score  variance 
(of  )and  error  variance  (a| ).  The  reliability  of  an  instrument  is  then  defined  as  the  proportion  of  observed 
score  variance  which  is  true  score  variance  (of'cS ).  In  other  words,  reliability  represents  freedom  from 
measurement  error  (a$a$  *  a?/ a?  -  a^a£)- 

In  practice,  researchers  have  operationalized  reliability  in  a  number  of  ways.  Each  strategy  identifies  a 
different  source  of  measurement  error  and  estimates  a  different  ’Type"  of  reliability.  When  a  test  is 
administered  on  two  occasions,  the  correlation  between  those  two  sets  of  scores  is  called  test- retest  reliability 
and  it  estimates  error  due  to  variance  in  measurement  conditions  over  time.  The  correlation  between  scores 
assigned  by  two  raters  or  judges  is  called  "conspect  reliability,"  and  it  estimates  error  variance  introduced 
by  differer  :  scorers.  Though  classical  test  theory  provides  different  descriptors  of  reliability,  it  still  uses  only 
a  single  term  for  each  type  of  error  implicit  in  each  form  of  reliability.  Relationships  among  different  kinds  of 
measurement  error  are  unclear  and  (more  critically)  inestimable.  Classical  test  theory  leaves  us  with  a 
fundamental  paradox  of  a  single  true  score  but  multiple  estimates  of  true  score  variance  depending  on  how 
error  variance  is  defined.  Furthermore,  since  the  theory  is  univariate  by  nature,  it  does  not  easily  allow  for 
the  estimation  of  the  joint  effects  of  multiple  sources  of  error  variance.  Thus,  while  the  Spearman-Brown 
prophecy  formula  allows  us  to  predict  the  resulting  reliability  of  a  test  after  the  number  of  items  on  the  test 
is  increased  or  decreased,  it  does  not  allow  us  to  predict  the  increase  in  reliability  from  an  increase  in  both 
the  length  of  the  test  and  the  number  of  times  it  was  administered. 


Multifaceted  Approach  of  Generalizability  Theory 

In  contrast  to  classical  test  theory,  generalizability  theory  explicitly  recognizes  the  existence  of  multiple 
sources  of  error  variance  and  provides  methods  for  simultaneously  estimating  each.  It  encourages  the 
researcher  to  explicitly  consider  conditions  which  may  affect  the  measurement  process.  For  example,  the 
ratings  one  individual  receives  from  a  group  of  peers  may  depend  upon  which  particular  peers  were  selected 
as  raters,  which  dimensions  were  chosen  as  stimuli,  the  time  of  day  each  peer  observed  the  ratee,  or  the 
difficulty  of  the  tasks  performed  by  the  ratee  at  the  time  of  observation.  To  say  that  ratings  depend  on  these 
conditions  is  to  say  that  the  ratee’s  expected  score  could  change  if  different  elements  of  each  factor  were 
sampled. 
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In  general  lability  theory,  the  researcher  identifies  the  factors  affecting  measurement  which  are  of  the 
greatest  interest  or  importance.  Then,  the  researcher  specifies  a  particular  range  of  levels  of  each  factor  for 
study.  In  G  theory  terminology,  factors  of  measurement  are  called  “facets"  and  levels  of  the  facet  are  called 
“conditions."  Thus,  in  a  study  of  the  generalizability  of  ratings  of  Walk-Through  Performance,  two  important 
facets  might  be  the  ac*.>al  tasks  performed  and  the  test  administrators.  That  is,  the  score  any  ratee  receives 
could  depend  on  tv  difficulty  of  the  tasks  which  comprise  the  Walk-Through  Test  or  on  any  idiosyncrasies 
of  individual  administrators/raters. 

A  generalizability  (G)  study  could  be  designed  to  estimate  the  contribution  of  the  task  and  administrator 
facets  to  total  score  variance.  Ideally,  random  samples  of  tasks  and  administrators  (drawn  from  the 
population  of  all  possible  tasks  and  administrators)  would  be  identified  and  tested  on  a  large  sample  of 
individuals.  The  populations  of  all  possible  tasks  and  administrators  define  the  universe  of  admissible 
observations,  the  boundaries  of  which  must  be  explicitly  defined  by  the  researcher.  That  is,  the  researcher 
would  specify  what  constitutes  an  acceptable  task  or  unacceptable  task  for  any  given  administration  of  the 
Walk-Through. 

The  G  study  could  be  designed  as  fully  crossed  or  nested.  In  a  fully  crossed  design,  all  conditions  of 
one  facet  are  observed  in  combination  with  all  conditions  of  the  other  facet;  i.e.,  all  administrators  rate  all 
tasks.  With  a  nested  design,  different  conditions  of  one  facet  are  nested  in,  or  observed  within,  different 
conditions  of  the  other  facet.  For  example,  if  tasks  were  nested  within  administrators,  each  administrator 
would  observe  and  rate  a  different  set  of  tasks.  In  genera),  fully  crossed  designs  are  preferable  because  they 
allow  for  direct  estimation  of  all  possible  variance  components,  but  nested  designs  are  often  used  either 
because  they  are  more  efficient  or  because  they  reflect  real-world  situations.  (For  example,  in  most  G  studies 
of  teacher  evaluations,  students  are  nested  within  classes ) 

For  the  present  example,  let  us  assume  that  we  have  designed  a  fully  crossed  G  study  in  which  all 
individuals  are  examined  on  10  different  tasks  by  four  different  administrators.  This  design  is  illustrated  in 
Figure  1.  The  cube  in  Figure  1  represents  all  ceils  in  the  design,  or  all  possible  combinations  of  q  persons 
observed  on  1 0  tasks  by  four  administrators.  Note  that  in  contrast  to  traditional  analysis  of  variance  (ANOVA) 
designs,  persons  (or  subjects)  are  treated  as  a  factor  with  only  a  single  replicate  in  each  cell. 
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Figure  1 .  Sample  Generalizability  Design. 
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With  a  set  of  persons  observed  in  a  two-facet  fully  crossed  design,  there  are  seven  possible  sources  of 
variance.  These  sources  are  illustrated  by  the  Venn  diagram  in  Figure  2.  In  ANOVA  terms,  there  are  three 
"main  effects"  represented  graphically  by  the  variance  components  for  tasks  (of),  administrators  (of ),  and 
ratees  or  persons  (a £  )•  a?  and  £|  represent  sources  of  systematic  error  variance  while  ag  represents 
(des1  :ble)  variance  due  to  individual  differences.  In  addition,  there  are  three  two-way  interaction  terms:  a|t 
and  Opa  and  afa.  These  interactions  may  be  interpreted  as  follows.  Variance  due  to  the  task-by-persons 
interaction  (a|t )  might  be  high  if  persons  performed  very  well  on  some  tasks  but  very  poorly  on  others.  afa 
would  indicate  error  due  to  differential  scoring  of  tasks  by  administrators  while  a|a  would  indicate  whether 
persons  were  differentially  ranked  by  different  administrators.  The  final  area  of  the  Venn  diagram  represents 
the  confounding  of  the  three-way  interaction  of  tasks,  administrators,  and  persons  with  random  error  variance 
+a§ ).  These  two  sources  (variance  attributable  to  the  three-way  interaction  and  error  variance)  cannot 
be  separated  since  there  is  only  a  single  observation  per  "cell"  in  the  design.  It  is  important  to  note  that  some 
of  what  is  called  random  error  variance  in  this  design  could  be  explained  as  systematic  variance  in  a  more 
complex  design  with  additional  facets.  Ideally,  enough  facets  could  be  identified  and  measured  to  explain 
all  error  variance. 
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The  multifaceted  treatment  of  error  variance  by  generalizability  theory  should  be  clearer  at  this  point. 
For  any  set  of  persons,  the  total  observed  score  variance  is  represented  by  the  lower  full  circle  in  Figure  2. 
Generalizability  theory  partitions  that  variance  into  individual  difference  variance  (£p  )  and  multiple,  distinct 
sources  of  error  variance  (e.g.,  a|,  and  a|a  and  S,a).  Just  as  classical  test  theory  provides  a  reliability 
coefficient  conceptually  equal  to  a?/a|>  generalizability  theory  provides  a  generalizability  coefficient,  £02' 
which  is  equal  to  a|  /a?  or  /(op  +  a|,  +  £pa  +  o|a  )  in  the  present  example.  However,  generalizability 
coefficients  computed  from  G  study  estimates  are  generally  not  as  useful  as  those  generated  from  a  different 
type  of  study  called  a  decision  (D)  study,  which  is  described  below.  Also,  many  researchers  have  stressed 
that  interpretation  of  the  variance  components  themselves  is  at  least  as  important  as  the  interpretation  of  the 
generalizability  coefficient  (e.g.,  Brennan  &  Kane,  1979). 
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To  return  again  to  the  example,  we  have  developed  a  fully  crossed  two-facet  design  which  allows 
estimation  of  seven  distinct  sources  of  variance.  After  the  data  are  collected,  we  must  compute  the  actual 
variance  components  for  each  source.  Variance  components  are  determined  from  the  mean  squares  of  a 
traditional  ANOVA.  In  this  analysis,  persons  are  treated  as  a  between-subjects  factor  (with  one  observation 
per  cell)  and  the  multiple  conditions  of  each  facet  are  treated  as  repeated  measures  on  a  facet.  The  precise 
formulas  for  determining  variance  component  estimates  from  ANOVA  mean  squares  are  often  complex,  but 
may  be  derived  from  algorithms  available  in  Brennan  (1983),  Brennan  and  Kane  (1979),  or  Cronbach  et  al. 
(1972). 

The  variance  components  computed  from  a  G  study  represent  estimated  variance  about  universe  scores 
for  average,  or  single,  observations;  e.g.,  a  single  person  evaluated  on  a  single  task  by  a  single  administrator. 
As  noted  above,  these  variance  components  can  be  used  to  compare  the  relative  contributions  of  error 
variance  sources  or  to  compute  generaiizabiiity  coefficients.  The  results  of  such  analyses  may  be  misleading, 
however,  as  the  G  study  design  may  differ  from  the  way  organizations  typically  use  measurement 
instruments.  That  is,  G  study  estimates  are  for  single  items  or  administrations,  yet  organizations  typically 
strive  for  multiple  measures  of  a  construct  (or  at  least  multiple-item  scales)  to  minimize  error  variance  relative 
to  universe  score  variance.  Thus,  it  is  necessary  to  distinguish  unitary  estimates  of  G  studies  from  the 
estimates  of  decision  (D)  studies  which  better  reflect  how  an  organization  uses  a  measurement  instrument. 


Generaiizabiiity  Analyses  for  Decision-Making 

A  G  study  establishes  the  general  characteristics  of  a  measuring  device;  in  particular,  the  relative  effects 
of  different  sources  of  variance.  It  is  not  unlikely  though  that  the  measurement  instrument  will  be  used  in  a 
manner  different  than  it  was  used  to  estimate  the  G  study  variance  components.  For  example,  the  Air  Force 
may  elect  to  decrease  the  number  of  tasks  used  on  the  Walk-Through  Performance  test  from  the  number 
sampled  for  the  G  study.  A  D  study  should  be  conducted  to  assess  the  specific  characteristics  of  a 
measurement  instrument  in  a  particular  decision-making  context.  D  studies  may  involve  a  different  sample 
with  a  unique  sampling  of  facets;  oftentimes  though,  such  a  sample  is  not  available  and  D  study  data  are 
simulated  from  G  study  data.  In  either  case,  the  D  study  is  defined  by  how  the  organization  intends  to  use 
the  measurement  instrument.  As  noted  by  Gillmore  (1979, 1983),  two  critical  specifications  in  nearly  any  D 
study  are  (a)  the  universe  of  generalization,  and  (b)  the  number  and  type  of  conditions  to  be  sampled  for 
each  facet.  These  are  explained  in  greater  detail  below. 

The  concept  of  the  universe  of  generalization  is  closely  related  to  the  concept  of  the  universe  of  admissible 
observations.  The  universe  of  admissible  observations  refers  to  the  facets  that  one  decides  to  include  in  a 
G  study  and  to  the  range  of  conditions  which  can  be  sampled  from  each.  The  universe  of  generalization  in 
a  corresponding  D  study  can  be  no  larger  than  the  universe  of  admissible  observations;  that  is,  it  cannot 
contain  facets  missing  in  the  universe  of  admissible  observations,  nor  can  it  contain  a  broader  range  of 
conditions.  The  universe  of  generalization  may  be  smaller,  however.  For  example,  if  the  universe  of 
admissible  observations  included  all  tasks  performed  by  jet  engine  mechanics,  a  smaller  subset  of  tasks  - 
such  as  all  cognitive  tasks  or  all  tasks  using  a  particular  tool  -  may  be  specified  for  purposes  of 
decision-making. 

The  more  important  consideration  in  defining  a  universe  of  generalization  is  deciding  whether  each  facet 
is  to  be  considered  fixed  or  random.  Random  facets  imply  that  the  conditions  of  a  facet  in  the  D  study 
represent  a  random  sample  from  an  essentially  larger  set  of  possible  admissions.  In  practice,  it  is  probably 
not  necessary  to  actually  use  full  random  sampling  procedures.  However,  one  must  at  least  be  willing  to 
assume  that  the  conditions  sampled  in  the  0  or  G  study  could  be  replaced  with  other  elements  of  some 
larger  set  of  possible  observations  without  affecting  the  universe  score  (Shavelson  &  Webb,  1981).  For 
example,  it  is  reasonable  to  assume  that  the  actual  administrators  on  any  particular  administration  of  the 
Walk-Through  Performance  Test  constitute  a  random  sample  of  all  possible  administrators,  since  other 
observers  could  be  trained  to  replace  them.  When  a  random  facet  is  specified,  generalization  is  not  limited 
to  the  set  of  D  study  conditions  but  Instead,  extends  to  the  entire  range  of  admissible  observations. 
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A  second  possibility  is  that  the  conditions  of  a  facet  in  a  D  study  exhaust  the  range  of  conditions  of 
interest  and  that  generalization  is  intended  only  within  that  particular  range.  In  this  instance,  the  facet  is 
considered  fixed,  and  generalization  is  limited  to  the  range  of  conditions  included  in  the  D  study.  Fixed 
effects  may  be  treated  in  either  of  two  ways.  First,  separate  variance  components  for  the  other  facets  may 
be  computed  within  each  level  of  the  fixed  facet  For  example,  consider  a  study  in  which  measurement 
occasions  and  rating  sources  (seif  and  supervisor)  were  the  facets  of  generalization.  If  the  organization  were 
going  to  use  the  ratings  of  different  sources  for  different  purposes,  separate  estimates  of  generalizability 
would  be  calculated  for  each  source.  In  other  instances,  the  organization  might  calculate  a  single  summary 
score  over  ail  conditions  of  the  fixed  facet.  For  example,  items  on  a  selection  test  might  be  fixed,  and  each 
applicant  receives  a  single  total  or  average  score.  Here,  the  generalizability  of  the  summary  score  is  of 
interest.  Since  individual  conditions  (items)  no  longer  matter,  there  can  be  no  errors  due  to  sampling  of 
person-item  interactions.  Thus,  this  variance  component  would  contribute  to  universe  score  variance  and 
not  to  error  variance.  This  would  have  the  effect  of  increasing  the  generalizability  of  the  test,  though  the 
improvement  would  hold  only  for  the  fixed  set  of  conditions. 

It  should  be  noted  that  considerations  of  fixed  and  random  facets  occur  at  the  D  study  level,  not  at  the 
G  study  level.  When  computing  G  study  estimates  of  variance  components,  all  facets  are  treated  as  random 
(i.e.,  all  are  estimated).  In  subsequent  D  studies,  the  variance  components  are  set  to  zero  if  the 
generalizability  of  average  scores  for  that  facet  are  of  interest.  Shavelson  (1986)  recommended  inspection 
of  the  G  study  variance  components  for  the  fixed  facet  as  a  means  of  deciding  how  to  treat  the  facet  for  D 
study  analyses.  If  the  variance  component  for  the  fixed  facet  is  large,  it  may  be  inappropriate  to  average 
over  its  conditions  and  Shavelson  recommends  separate  D  study  analyses  for  other  facets  within  each  level 
of  the  fixed  facet. 

Another  specification  in  the  D  study  is  the  number  of  conditions  for  each  facet.  These  are  not  restricted 
to  the  number  of  conditions  sampled  in  the  G  study.  Instead,  the  investigator  can  systematically  vary  the 
number  of  conditions  in  the  D  study  to  forecast  the  resulting  generalizability.  G  study  variance  component 
estimates  are  actually  average  effects  for  single  occurrences  of  each  facet.  That  is,  a  G  study  variance 
component  represents  measurement  error  when  only  a  single  level  of  the  facet  is  used.  However, 
measurement  error  decreases  as  scores  for  the  object  of  interest  are  averaged  over  multiple  levels  of  a  facet. 
Analogous  to  the  Spearman-Brown  prophecy  formula  of  classical  test  theory,  D  study  estimates  of  variance 
components  are  computed  to  estimate  the  actual  degree  of  error  variance  under  varying  numbers  of  levels 
of  each  facet.  Since  the  variance  of  a  mean  (score  over  levels  of  the  facet)  is  equal  to  the  mean  of  the 
variance,  D  study  estimates  are  computed  by  dividing  the  G  study  variance  component  estimates  by  the 
number  of  conditions  specified  in  the  D  study.  Logically  and  empirically,  the  error  variance  attributed  to  any 
one  source  approaches  zero  as  the  number  of  conditions  grows  infinitely  large.  Computing  different  D  study 
estimates  for  differing  numbers  of  conditions  allows  the  researcher  to  predict  how  dependable  a  measure 
would  be  under  a  variety  of  measurement  conditions. 

Finally,  it  should  also  be  noted  that  Cronbach  et  al.  (1972)  recognized  that  decision-makers  use  tests 
for  different  purposes;  thus,  they  specified  different  error  terms  for  each  purpose.  In  most  cases,  tests  are 
used  for  either  relative  or  absolute  decisions.  For  a  relative  decision,  the  test  is  used  only  to  rank-order 
persons.  Two  common  examples  would  be  a  predictor  used  with  a  top-down  hiring  process  or  a  criterion 
which  is  to  be  correlated  with  a  predictor.  In  this  case,  errors  In  persons’  actual  scores  do  not  matter,  as 
long  as  these  errors  are  equivalent  for  each  person  In  the  sample.  Problems  occur  only  when  errors  exist 
for  some  persons  but  not  for  others.  Thus,  if  a  test  consisted  of  a  particularly  difficult  sample  of  items,  variance 
due  to  items  would  not  be  considered  error  since  test  difficulty  alone  would  lower  ail  scores  but  preserve 
the  rank  order.  However,  the  person-by-item  variance  component  would  be  considered  error  since  this 
interaction  means  that  some  persons  are  affected  more  by  item  difficulty  than  are  others.  In  general,  the 
error  term  for  relative  decisions  includes  all  variance  components  which  represent  an  interaction  of  a  facet 
with  persons. 
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Absolute  decisions  are  those  made  about  an  actual  score.  A  common  example  is  a  driving  test  in  which 
the  applicant  must  attain  a  certain  score  to  receive  a  license.  For  such  decisions,  ail  effects  which  affect  the 
level  of  the  score  are  included  in  the  error  term.  Thus,  variance  due  to  items  would  be  included  since  a 
particularly  easy  or  difficult  sample  of  items  would  affect  a  person’s  likelihood  of  passing.  In  general,  the 
error  term  for  absolute  decisions  Includes  all  variance  components  other  than  that  which  constitutes  the 
universe  score  variance  (typically  persons).  Typically,  specifications  of  relative  or  absolute  error  terms  are 
made  at  the  D  study  level. 

The  discussion  to  this  point  may  be  summarized  as  follows.  An  investigator  will  conduct  a  single 
large-scale  G  study  to  estimate  variance  components  for  each  effect  in  a  model.  From  this  single  set  of 
variance  component  estimates,  the  researcher  can  generate  (simulate)  numerous  sets  of  0  study  variance 
component  estimates  and  generaiizability  coefficients,  depending  upon  how  the  measuring  instrument  is  to 
be  used.  It  is  these  D  study  results  which  are  of  the  most  interest  to  decision-makers  since  D  study  results 
reflect  realistic  or  intended  measurement  conditions.  Interpreting  G  and  D  study  results  is  discussed  further 
in  the  next  section  on  applications  of  generaiizability  theory. 


Applications  of  Generaiizability  Theory 

There  have  been  relatively  few  applications  of  generaiizability  theory  in  the  field  of 
industrial/organizational  (I/O)  psychology.  What  little  has  been  done  has  been  in  the  area  of  job  analysis 
and  Job  evaluation.  Two  studies  (Webb  &  Shaveison,  1981;  Webb,  Shavelson,  Shea,  &  Moreilo,  1981)  have 
explored  the  generaiizability  of  General  Educational  Development  (GED)  ratings  by  job  analysts.  Three  other 
studies  have  applied  generaiizability  theory  to  various  jobevaluation  instruments  (Doverspike  &  Barrett,  1984; 
Doverspike,  Carlisi,  Barrett,  &  Alexander,  1983;  Fraser,  Cronshaw,  &  Alexander,  1984).  Two  representative 
studies  are  described  below. 

First,  Webb  et  al.  (1981)  investigated  the  generaiizability  of  GED  ratings  of  reasoning  develooment, 
mathematics  development,  and  language  development  by  experienced  job  analysts.  These  GED  ratings  are 
subjective  evaluations  of  the  level  of  cognitive  skill  necessary  to  perform  various  jobs,  and  are  frequently 
used  to  estimate  training  requirements  or  refer  persons  to  job  training  programs.  Based  on  in-house  job 
descriptions,  71  analysts  rated  27  jobs  in  terms  of  GED  requirements.  The  raters  were  nested  within  one  of 
1 1  different  field  centers,  and  they  rated  the  jobs  on  two  different  occasions.  Thus,  the  investigators  employed 
a  three-facet  design,  with  raters  nested  within  offices  and  crossed  with  jobs  and  occasions.  (Note  that  jobs 
are  the  object  of  measurement  and  are  not  considered  a  facet.)  Separate  analyses  were  performed  for  each 
GED  rating  scale. 

The  results  showed  favorable  results  regarding  the  generaiizability  of  GED  ratings.  For  the  design 
described  above,  computed  generaiizability  coefficients  (ea2)  of  ratings  from  an  average  rater  at  an  average 
center  on  one  occasion  ranged  from  .53  to  .67  for  the  three  scales.  However,  D  study  data  showed  that  the 
generaiizability  coefficients  ranged  from  .79  to  .85  for  the  mean  of  four  raters.  Inspection  of  the  variance 
components  showed  that  differences  in  jobs  accounted  for  the  largest  variance  in  ratings.  The  largest 
sources  of  undesirable  error  variance  were  jobs  crossed  with  raters  within  centers,  and  the  residual  error 
term.  The  latter  component  includes  random  or  undifferentiated  error  variance  while  the  former  component 
represents  idiosyncratic  perceptions  of  certain  jobs  by  raters  at  certain  centers.  That  is,  error  of  this  nature 
would  result  if  raters  at  one  center  perceived  the  GED  requirements  of  a  Job  (such  as  "central-supply  worker”) 
differently  than  did  raters  at  other  centers. 

In  a  second  study,  Fraser  et  al.  (1984)  used  generaiizability  theory  to  determine  the  reliability  of  job 
evaluation  ratings.  Ratings  were  made  by  three  analysts  on  12  different  jobs,  using  eight  different  job 
evaluation  scales  (e.g.,  required  education,  contact  with  others,  and  working  conditions).  Ratings  were  again 
based  on  in-house  job  descriptions.  Jobs  were  the  object  of  measurement,  and  the  design  was  fully 
crossed. 
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The  data  were  first  analyzed  with  raters  and  scales  as  the  facets  of  generalization.  However,  there  was 
considerable  variation  due  to  evaluation  scales  so  that  it  seemed  inappropriate  to  assume  that  the  different 
scales  were  simply  replicates  of  the  same  facet  Instead,  separate  analyses  were  conducted  for  each  of  the 
eight  scales.  Jobs  were  still  the  object  of  measurement,  but  raters  were  the  only  facet  of  generalization. 

For  most  scales,  jobs  accounted  for  the  most  variance  in  ratings.  This  is  desirable  since  jobs  were  the 
object  of  measurement  There  was  little  or  no  variance  due  to  raters  on  most  scales,  but  there  was 
considerable  variance  due  to  the  rater-by-job  interaction,  in  other  words,  the  results  showed  that  jobs  were 
differentially  ranked  by  analysts  on  most  scales.  Still,  the  generalizability  of  job  evaluation  ratings  was  fairly 
high.  For  seven  of  the  scales,  the  generalizability  coefficients  ranged  from  .70  to  .92  for  one  random  rater 
(G  study  results)  and  from  .88  to  .92  for  the  average  of  three  random  raters  (D  study  results).  Fraser  et  al. 
(1 984)  also  compared  their  results  to  those  of  Doverspike  et  al.  (1 983),  who  used  graduate  students  as  raters. 
Interestingly,  while  the  generalizability  coefficients  (and,  hence,  the  relative  size  of  variance  components) 
were  similar  across  studies,  the  variance  components  were  considerably  larger  in  the  Fraser  et  al.  field  study. 
Presumably,  this  is  because  actual  analysts  used  more  extreme  ratings  to  maximize  discriminability  among 
jobs. 

To  date,  there  have  been  no  published  applications  of  generalizability  theory  to  performance  appraisals 
in  standard  organizational  settings.  However,  examples  abound  in  the  clinical  psychology  and  educational 
measurement  domains.  For  example,  the  generalizability  of  behavior  observations  or  clinical  assessments 
has  been  studied  by  Edinberg,  Karoly,  and  Gleser  (1977);  Farrell,  Mariotto,  Conger,  Curran,  and  Wallander 
(1979);  Gleser,  Green,  and  Winget  (1978);  Littlefield,  Murrey,  and  Garman  (1977);  Mariotto  and  Farrell  (1979); 
and  Wieder  and  Weiss  (1980).  All  applications  to  student  evaluations  of  instructors  are  too  numerous  to  list 
completely  but  several  representative  studies  include:  Carbno  (1 981 ) ;  Gillmore,  Kane,  and  Naccarato  (1 978) ; 
Hopkins  (1983);  and  Kane,  Gillmore,  and  Crooks  (1976).  One  clinical  assessment  example  is  applicable  and 
is  described  in  greater  detail  below. 

Littlefield  et  al.  (1977)  examined  the  generalizability  of- faculty  ratings  of  third-  and  fourth-year  dental 
students.  Ratings  were  made  on  five  general  dimensions  of  noncognitive  skills.  They  were  collected  from 
31  faculty  members  on  12  students  during  one  phase  and  from  16  faculty  on  5  students  during  another 
phase.  Each  phase  was  considered  a  separate  0  study  since  it  reflected  a  different  set  of  measurement 
conditions  (differing  numbers  of  raters).  Students  were  the  object  of  measurement  and  the  facets  of 
generalization  were  raters  (faculty)  and  rating  scales.  Separate  D  study  analyses  were  also  conducted  with 
the  scales  treated  as  fixed  or  random  (i.e.,  a  mean  scale  score  was  used).  The  generalizability  of  ratings 
across  both  raters  and/or  scales  was  quite  high.  The  generalizability  coefficients  were  .92  and  .83  for  the 
two  phases  of  rat'-.gs.  Wfvn  scales  were  considered  fixed  and  the  generalizability  of  the  mean  score  across 
raters  was  computed,  the  generalizability  coefficients  increased  to  .95  and  .86.  In  other  words,  about  90% 
of  the  variance  in  scores  could  be  attributed  to  individual  differences  in  ratees  (universe  score  variance). 
Simulated  D  study  results  were  also  computed  for  a  more  realistic  organizational  condition  -  ratings  obtained 
from  only  one  or  two  raters.  As  would  be  expected,  generalizability  coefficients  were  considerably  lower, 
ranging  from  .53  to  .61  for  one  rater  and  from  .68  to  .76  for  two  raters.  Littlefield  et  al.  concluded  that  at  least 
two  raters  were  necessary  for  dependable  ratings. 


Summary  and  Critique 


Generalizability  theory  is  a  versatile  and  efficient  way  of  estimating  the  effects  of  measurement  conditions 
on  tests  and  instruments.  With  a  single  design,  the  researcher  is  able  to  simultaneously  estimate  error 
variance  due  to  facets  such  as  items,  test  occasions,  and  raters.  Notably,  their  joint  effects  can  be  assessed 
as  well.  In  contrast,  classical  test  theory  enables  assessment  of  measurement  error  from  oniy  one  source 
at  a  time.  Clearly,  generalizability  theory  offers  a  more  realistic  portrayal  of  measurement  error  than  does 
classical  test  theory. 
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Another  primary  contribution  of  generalizability  theory  is  that  it  forces  the  researcher  to  address 
measurement  issues  that  are  often  ignored.  For  example,  the  researcher  must  cleariy  understand  the 
universe  of  admissible  observations  to  which  he  or  she  wishes  to  generalize.  In  doing  so,  the  researcher 
must  explicate  conditions  of  measurement  which  may  affect  any  one  individual's  score.  These  conditions 
define  the  universe  of  admissible  observations,  place  limits  on  the  instrument’s  generaiizability,  and  alert  the 
researcher  to  sources  of  error  in  measurement  A  second  example  is  the  0  study  specification  of  whether 
the  instrument  is  to  be  used  for  relative  or  absolute  decision-making.  The  degree  of  error  differs  by  purpose. 
It  Is  well  accepted  that  particular  tests  may  be  dependable  for  grouping  cases  or  serving  as  a  criterion,  but 
not  for  differentiating  among  individuals;  however,  there  has  been  no  acceptable  way  before  generalizability 
theory  to  express  these  differences. 

Finally,  generalizability  theory  is  valuable  because  it  re-emphasizes  traditional  notions  of  reliability. 
Although  it  offers  a  more  complex  and  meaningful  analysis  of  reliability,  generalizability  theory  is  nonetheless 
based  on  the  assumption  that  to  be  useful  a  measure  must  be  consistent  or  replicable  over  applications.  In 
the  field  of  industrial/organizationai  psychology  in  general,  and  performance  appraisal  in  particular,  we  often 
lose  sight  of  the  importance  of  reliability  and  consistency.  Instead,  issues  such  as  rating  errors,  construct 
validity,  and  accuracy  dominate  conceptual  and  empirical  work.  Reliability  is  so  often  overlooked  that  few 
validation  studies  report  coefficients  for  either  predictors  or  criteria  (e.g.,  Pearlman,  Schmidt,  &  Hunter,  1980). 
We  ail  know  that  a  measure  must  be  reliable  to  be  valid;  however,  we  tend  to  forget  it.  Generalizability  theory 
refocuses  our  attention  on  reliability  and  offers  promise  for  improving  the  reliability  of  measures  by  providing 
the  means  for  pinpointing  sources  of  error. 


Analysis  Plan 

In  that  generalizability  theory  does  have  much  to  offer  to  researchers  and  psychometricians,  a 
demonstration  study  will  be  presented.  A  broad  G  study  will  be  presented  along  with  several  sets  of  0  study 
results.  The  study  examines  the  generalizability  of  performance  ratings  collected  as  part  of  the  Air  Force 
performance  measurement  project  In  particular,  it  identifies  three  primary  facets  which  could  affect  scores 
of  individuals:  rating  forms,  items  or  dimensions  on  forms,  and  rating  sources  (self,  peer,  or  supervisor). 
Besides  serving  as  a  demonstration  problem  for  the  Air  Force,  these  analyses  should  be  of  interest  to  other 
scientists  as  well,  since  the  comparative  value  of  information  derived  from  different  forms  or  sources  has 
been  an  important  research  question  for  many  years  (e  g.,  Kraiger,  1985;  Landy  &  Farr,  1980). 


METHOD 


Sample 


Proficiency  ratings  were  collected  from  256  first-term  U.S.  Air  Force  enlisted  personnel.  Ail  ratees  were 
jet  engine  mechanics  (Specialty  Code  426X2).  Ratees  had  between  6  months  and  42  months  of  job 
experience  and  worked  on  one  of  three  primary  engine  types. 


Facets  of  Generalization 


In  the  present  study,  there  were  three  facets  of  generalization:  rating  forms,  specific  items  on  each  form, 
and  rating  sources  (seif,  peer,  and  supervisor).  Items  were  nested  within  forms  and  both  were  crossed  with 
sources.  As  illustrated  by  the  Venn  diagram  in  Figure  3,  there  are  12  distinct  sources  of  variance:  error, 
persons,  sources,  forms,  items  within  forms,  forms  by  sources,  persons  by  sources,  persons  by  forms,  items 
within  forms  by  persons,  items  within  forms  by  sources,  persons  by  forms  by  sources,  and  persons  by 
sources  by  items  within  forms.  Each  facet  is  described  in  greater  detail  below. 


8 


Raters  (r) 

Figure  3.  Venn  diagram  illustrating  variance  components  tor  generalizability  design. 


Rating  Forms.  Four  different  forms  were  used  to  collect  proficiency  ratings  as  part  of  a  larger  Air  Force 
research  project  on  performance  measurement.  These  forms  can  be  assumed  to  be  random  samples  of  a 
larger  universe  of  all  possible  forms  which  could  be  used  to  assess  ratee  performance.1  An  important 
research  question  is  whether  performance  evaluations  generalize  over  different  rating  forms. 

In  the  present  study,  form  content  ranged  from  very  specific  to  very  general.  The  most  specific  was  the 
task  rating  form.  This  form  required  ratings  on  26  to  32  tasks  representative  of  the  job  content  domain.  The 
precise  number  of  tasks  on  a  particular  version  of  the  form  depended  upon  a  ratee’s  engine  assignment  and 

!*n  alternative  assumption  is  that  these  particular  forms  are  not  random  samples  from  the  same  universe 
but  are  representative  of  different  levels  within  a  hierarchical  universe  of  form  specificity.  That  is,  dimensions 
on  some  forms  could  be  nested  within  dimensions  on  more  general  types  of  forms.  However,  research  by 
Vance,  MacCallum,  Coovert,  and  Hedge  (1988)  suggested  that  each  form  provides  equivalent  (and 
redundant)  information.  Moreover,  a  hierarchical  design  Is  less  desirable  in  the  present  context  because  it 
would  be  more  difficult  to  follow  as  a  demonstration  and  would  result  in  considerable  toss  of  data  because 
of  dimensions  not  arranged  hierarchically  and  because  it  would  create  an  unbalanced  design. 


whether  the  principal  work  location  was  the  shop  or  flightline.  Tasks  were  identified  from  an  earlier 
comprehensive  occupational  survey  conducted  by  the  Air  Force.  Examples  of  task  items  are:  installs  J-79 
engine  starter  adapter  pads"  and  "Inspects  J-79  engine  aircraft  throttle  controls  for  freedom  of  movement." 
Ratings  were  on  a  5-point  scale  ranging  from  1  (Never  meets  acceptable  level  of  proficiency)  to  5  (Always 
exceeds  acceptable  level  of  proficiency). 

A  slightly  less  specific  form  was  the  dimensional  rating  form.  This  form  required  ratings  on  six  job 
dimensions  identified  through  factor  analyses  of  occupational  survey  data  and  judgments  of  subject-matter 
experts.  (Again,  the  selection  of  actual  dimensions  on  a  particular  version  of  the  form  depended  upon  a 
ratee’s  engine  assignment  and  work  location.)  Examples  of  dimensions  are:  inspect  Engine, 
Remove/Replace  Engine  Components,  Completion  of  Forms.  Ratings  were  made  on  a  5-point  proficiency 
scale  similar  to  the  one  used  for  the  task-level  ratings.  However,  each  scale  point  was  anchored  by  a 
behavioral  summary  description. 

More  general  was  the  global  rating  form.  This  form  also  used  a  5-point  proficiency  scale  anchored  by 
behavioral  descriptions;  the  form  consisted  of  only  two  broad  dimensions:  technical  proficiency  and 
interpersonal  proficiency.  These  were  thought  to  be  two  Important  factors  underlying  most  specific 
dimensions  typically  used  in  performance  appraisals  (Kavanagh  et  al.,  1986). 

The  fourth  and  most  general  form  was  the  Air  Force-wide  rating  form.  It  required  ratings  on  eight  broad 
dimensions  thought  to  be  relevant  to  all  Air  Force  specialties.  Examples  of  these  dimensions  are:  T echnical 
Knowledge/Skill,  Initiative/Effort,  and  Military  Appearance.  Ratings  were  again  made  on  a  5-point  proficiency 
scale,  anchored  by  behavioral  descriptions. 

Items  within  Forms.  A  second  facet  of  interest  was  individual  items  on  each  form.  For  each  rating  form, 
individual  items  or  dimensions  can  be  considered  a  random  sample  of  a  larger  universe  of  all  possible  items 
which  could  comprise  that  form.  However,  the  possible  universe  differs  for  each  form.  Thus,  items  nested 
within  facets  were  considered  a  second  facet  of  generalization. 

• 

There  was  a  computational  problem  with  this  facet  In  the  description  of  the  rating  forms,  it  can  be  noted 
that  the  number  of  items  or  dimensions  on  each  form  ranged  from  2  to  32.  in  standard  ANOVA  terms,  this 
means  that  the  items  facet  is  unbalanced  because  there  Is  a  different  number  of  conditions  of  items  under 
each  condition  of  forms.  Though  there  has  been  little  empirical  work  on  the  precise  effects  of  unbalanced 
designs  on  variance  component  estimates,  it  is  known  that  unbalanced  designs  in  mixed  ANOVAs  often  yield 
biased  mean  square  estimates  (Searle,  1971),  which,  in  turn,  are  used  to  estimate  variance  components.  In 
general,  experts  in  generalizability  theory  recommend  against  the  use  of  unbalanced  designs  (Brennan  & 
Kane,  1979;  Shavelson  &  Webb,  1981). 

There  are  several  ways  of  handling  this  problem,  though  none  is  completely  desirable.  One  strategy 
would  be  to  randomly  select  only  two  items  on  each  form.  This  would  be  a  considerable  waste  of  data. 
Further,  Monte  Carlo  work  by  Smith  (1 978)  has  shown  that  sampling  errors  in  variance  components  increase 
substantially  when  each  additional  facet  in  a  design  has  less  than  7  to  10  conditions.  Since  sources  and 
forms  (the  other  two  facets)  already  have  a  small  number  of  conditions,  additional  estimation  problems  could 
ensue  from  analyzing  the  items  facet  with  only  two  conditions. 

A  second  strategy  would  be  to  exclude  the  global  form  with  its  two  items,  randomly  select  six  task  items 
and  six  Air  Force-wide  dimensions,  and  run  the  generalizability  analyses  with  only  three  conditions  of  the 
form  facet.  This  method  also  adds  a  new  facet  with  a  small  number  of  conditions,  though  the  error  introduced 
should  not  be  as  severe  as  when  a  facet  with  only  two  conditions  Is  added.  Additionally,  the  possibility  of 
inaccurately  estimating  variance  due  to  forms  increases,  because  this  is  now  analyzed  with  only  three 
conditions.  As  there  is  no  one  clearly  preferable  strategy,  both  methods  were  attempted  and  the  results 
compared. 
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Rating  Sources.  The  final  facet  of  generalization  was  the  source  of  the  performance  ratings.  Ratings 
were  collected  from  incumbents,  their  peers,  and  their  supervisors.  The  question  of  interest  was  whether 
ratings  generalize  over  sources.  In  some  settings,  these  sources  may  be  considered  a  random  sample  of  a 
larger  universe  of  all  possible  observers/raters  of  performance  (also  including  subordinates,  second -level 
supervisors,  clients,  etc.).  However,  there  were  no  other  possible  rating  sources  of  interest  to  the  Air  Force. 
Thus,  rating  sources  was  considered  a  fixed  facet  Computationally,  variance  due  to  rating  sources  was 
estimated  at  the  G  study  level  as  if  sources  were  random.  At  the  D  study  level,  the  rating  source  facet  was 
treated  in  two  ways.  First,  a  mean  score  across  sources  was  computed  and  the  generalizability  of  this  score 
was  computed  across  forms  and  items  within  forms.  Secondly,  the  generalizability  of  forms  and  items  was 
computed  for  each  rating  source.  For  these  analyses,  there  were  only  five  sources  of  variance:  persons, 
forms,  items  within  forms,  persons  by  forms,  and  persons  by  items  within  forms.  Consistent  with  the 
recommendations  of  Shaveison  (1986),  it  is  recognized  that  the  latter  analyses  would  be  considered  the 
most  relevant  if  the  variance  of  the  persons  facet  was  relatively  large. 

One  final  note  on  rating  sources.  Mechanics  were  rated  by  one  to  three  coworkers  selected  for  their 
familiarity  with  the  ratee  and  based  on  their  availability.  T o  again  avoid  the  problem  of  an  unbalanced  design, 
a  single  peer  rating  was  randomly  selected  for  each  ratee  and  retained  for  analyses. 


Data  Collection 


Mechanics  rated  themselves  and  were  rated  by  selected  peers  and  supervisors.  For  all  raters,  the  order 
of  forms  was  global,  dimensional,  task,  and  then  Air  Force-wide.  Ratings  were  made  during  working  hours 
after  a  rigorous  rater  training  program.  All  raters  were  informed  that  the  ratings  were  being  collected  for 
research  purposes. 


G  Study  Analyses 

All  G  and  D  study  analyses  were  performed  using  GENOVA,  a  Fortran-based  computer  program 
designed  for  generalizability  analyses  (Crick  &  Brennan,  1983).  Because  GENOVA  uses  listwise  deletion  of 
missing  data,  missing  data  for  individuals  were  replaced  with  sample  means  when  three  or  fewer  data  points 
were  missing;  if  more  than  three  data  points  were  missing,  the  entire  case  was  deleted.  As  a  result  of  the 
treatment  of  missing  data,  different  designs  employed  different  sample  sizes.  It  should  be  noted  that  both 
analyses  with  and  without  missing  data  produced  similar  results. 

For  the  full  design,  ratees  (mechanics)  were  treated  as  a  between-subjects  factor  whereas  sources, 
forms,  and  items  nested  within  forms  were  treated  as  repeated  measures  factors.  For  within-rater  analyses, 
forms  and  items  nested  within  forms  were  the  repeated  measures  factors. 

D  Study  Analyses 

G  study  variance  components  were  used  as  input  for  simulated  D  study  results  under  different 
measurement  conditions.  All  D  study  analyses  were  conducted  on  data  from  the  3-form,  6-item  full  G  study 
design.  These  simulated  D  study  conditions  were  selected  to  approximate  current  Air  Force  testing 
conditions  or  to  represent  various  possible  conditions  of  use.  For  example,  estimated  variance  components 
and  generalizability  coefficients  were  calculated  from  the  full  design  for  instances  in  which  a  single  rater  uses 
a  single  8-item  form,  or  two  8-item  forms,  or  four  8-item  forms;  three  raters  (the  Incumbent,  a  peer,  and  a 
supervisor)  each  use  a  single  8-item  form;  and  three  raters  each  use  four  8-item  forms  (see  Table  4).  The 
latter  condition  approximates  current  measurement  conditions  on  the  Air  Force  Job  Performance 
Measurement  Project.  For  within-rater  analyses,  D  study  results  were  calculated  for  one  4-item,  one  8-item, 
two  8-item,  and  four  2-item  forms  (see  Tables  6,  7,  and  8).  Of  course,  D  study  analyses  are  not  limited  to 
these  sets  of  conditions.  Other  decision-makers  could  specify  their  own  conditions  and  calculate  additional 
statistics. 
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Operationally,  D  study  analyses  are  performed  by  reducing  the  G  study  variance  component  for  each 
facet  by  the  number  of  times  the  facet  is  measured  under  each  particular  set  of  measurement  conditions. 
For  example,  if  the  estimated  variance  component  for  forms  is  .0016,  the  D  study  estimate  would  be  .0008 
when  the  number  of  forms  is  two  and  .0004  when  the  number  of  forms  is  four.  Intuitively,  this  can  be 
understood  when  it  is  realized  that  G  study  estimates  represent  average  error  variance  for  a  single  observation 
and,  as  in  classical  test  theory,  these  estimates  are  reduced  by  a  factor  of  the  number  of  times  they  are 
measured. 

In  addition  to  the  variance  component  estimates  for  each  D  study,  total  relative  error  (£|)  and  absolute 
error  ( a |)  were  calculated,  along  with  a  corresponding  generaiizability  coefficient.  The  generaiizability 
coefficient  represents  the  proportion  of  universe  score  variance  to  total  variance.  For  relative  decisions  (e.g., 
for  validation  purposes),  total  variance  is  equal  to  universe  score  variance  plus  total  relative  error.  This 
coefficient  represents  the  generaiizability  of  ratings  over  conditions  of  measurement  when  all  facets  are 
random.  If  a  facet  were  to  be  considered  fixed,  a  new  generaiizability  coefficient  could  be  calculated  after 
first  adding  to  the  universe  score  variance  all  D  study  variance  components  involving  the  facet. 


RESULTS 

Generaiizability  results  will  be  presented  as  follows.  First  will  be  G  study  analyses  for  the  full  design 
(raters  by  items  within  forms)  and  within-rater  designs  (items  nested  within  forms).  For  the  full  design, 
separate  results  are  presented  for  analyses  with  six  items  within  three  forms,  and  two  items  within  four  forms. 
Only  three-form  analyses  are  presented  for  the  within-rater  analyses.  Next,  simulated  D  study  results  are 
presented  for  the  full  design  and  within-rater  designs. 


G  Study:  Full  Design 

G  study  estimates  of  variance  components  with  90%  confidence  intervals  are  presented'in  Tables  1  and 
2.  The  confidence  intervals  indicate  the  precision  in  estimation  of  the  population  values  of  variance 
components,  given  the  sample  size  and  design  complexity.  The  confidence  intervals  are  based  on  the  ratio 
of  the  estimated  variance  component  to  its  standard  error  and  were  calculated  from  procedures  detailed  by 
Satterthwaite  (1941 , 1946).  Satterthwaite’s  method  corrects  the  upper  limit  of  the  interval  which  is  frequently 
too  low  when  calculated  as  the  product  of  the  normal  deviate  and  the  standard  error.  Also  included  in  the 
tables  are  degrees  of  freedom  and  mean  squares  associated  with  each  effect.  It  should  be  noted  that  the 
variance  components  are  very  similar  across  designs.  One  exception  is  variance  due  to  forms.  Not 
surprisingly,  by  increasing  the  number  of  forms  In  the  G  study  design  from  three  to  four,  variance  due  to 
forms  is  decreased  (from  .033  to  .001).  The  confidence  intervals  in  Table  1  are  generally  narrower  than  those 
of  Table  2,  reflecting  smaller  standard  errors  and  greater  degrees  of  freedom  in  the  three-form,  six-item 
analysis. 

Inspection  of  the  variance  components  in  either  fable  reveals  several  interesting  findings.  First,  the  most 
variance  is  attributed  to  the  residual  term,  n  (.290  and  .293  in  the  six  items  within  three  forms  and  the 
two  Items  within  four  forms  designs,  respectively).  This  represents  undifferentiated  error.  The  effect  for 
persons,  variance  due  to  individual  differences,  was  fairly  large  in  both  designs  (£|  *  .100,  .151).  This 
variance  is  desirable  and  represents  unh/erse  score  variance.  Among  the  other  error  sources  of  variance  in 
the  design,  the  larger  effects  involved  the  interaction  of  rater  sources  and  ratees.  a|r  (.278, .  1 86)  was  large, 
Indicating  at  least  two  sources  differentially  ranked  ratees.  (.053,  .016)  was  also  nontrivial,  indicating 
that  sources  differentially  ranked  ratees,  but  did  so  differently  on  different  forms. 

it  should  be  recalled  that  rating  sources  can  be  considered  a  fixed  facet  for  subsequent  D  study  analyses. 
After  Shaveison  (1 986),  a  fixed  facet  can  be  treated  by  averaging  scores  across  levels  of  the  facet  or  analyzing 
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other  facets  within  levels  of  the  facet  The  general i2abillty  of  scores  over  levels  of  rater  sources  was 
investigated  and  these  results  are  presented  in  Table  3. 


Table  1.  Estimated  Variance  Components  for 
G  Study  with  Six  Items  and  Three  Forms 


Effect 

DF 

MS 

s2 

90%  confidence 
intervals 

Persons  (p) 

222 

11.550 

.100 

.072 <£?  <-148 

Raters  (r) 

2 

89.420 

.020 

.011  Co2  <  060 

Forms  (f) 

2 

152.005 

.033 

.017<a2  <  098 

Items  w/in  Forms  (i:f) 

15 

15.846 

.022 

.013< a2  <  046 

Pr 

42 

5.602 

.278 

.246  <  a2  <  -316 

pf 

442 

1.172 

.022 

.016< a2  <  033 

rf 

4 

2.900 

.001 

.001  <  a2  <  003 

prf 

84 

.606 

.053 

.045  <  a2  <  062 

P(i:f) 

3,315 

.460 

.057 

.051  <  a2  <  064 

rfl:f) 

30 

.892 

.003 

.002  <  a2  <•«* 

pr(i-'f) 

6,630 

.290 

.290 

.282  <£*  <  -298 

Table  2.  Estimated  Variance  Components  for 

G  Study  with  Two  Items  and  Four  Forms 

Effect 

DF 

MS 

&  • 

90%  confidence 
intervals 

Persons  (p) 

206 

5.591 

.151 

.1 19< a2  <.199 

Raters  (r) 

2 

27.841 

.015 

.008  <i?  <  044 

Forms  (f) 

3 

12.147 

.001 

.001  <a2  <  003 

Items  w/ln  Forms  (i:f)  4 

10.639 

.015 

.008  <  £2  <  044 

pr 

412 

1.812 

.186 

.163< a2  <-214 

pf 

618 

.477 

.000 

.000<£2  <.000 

rf 

6 

1.434 

.001 

,000<£2  <-002 

prf 

1,236 

.326 

.016 

.009  <  a2  <  048 

Pfl-f) 

824 

.464 

.057 

.046  <  a2  <074 

i’(i'-f) 

8 

1.086 

.004 

.0 02  <  a2  <011 

pr(i:f) 

1,648 

.293 

.293 

.276<  fj2  <-311 

G  Study:  Within-Rater  Design 

The  relatively  large  effect  for  the  persons-by-source  interaction  indicates  that  ratees  are  differentially 
ranked  by  sources.  This  finding  suggests  that  It  may  be  inappropriate  to  average  scores  over  sources  and 
that  separate  analyses  for  other  facets  should  be  conducted  within  rater  sources.  Estimated  variance 
components  of  an  items  within  forms  analysis  for  each  rating  source  are  presented  in  Table  3.  Results 
presented  are  based  on  six  items  within  three  forms;  similar  results  (not  presented)  were  obtained  from 
analyses  on  two  items  within  four  forms.  The  table  also  contains  degrees  of  freedom,  mean  squares,  and 
90%  confidence  intervals  associated  with  each  effect. 
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Table  3.  Study  Variance  Components  Within  Rater  Sources 


Effect 

DF 

MS 

^2 

_ a. _ 

90%  confidence 
intervals 

Seif: 

Persons  (p) 

217 

4.127 

.193 

.161  < a  <,235 

Forms  (f) 

2 

51.736 

.034 

.01 8  <  a2  <.102 

Items  w/in  forms  (i:f) 

15 

5.865 

.025 

.01 5  <  a  <052 

pf 

434 

.666 

.053 

.042  <  a2  <.068 

pi:f 

3,255 

.351 

.351 

.337  <  a2  <.368 

Supervisor: 

Persons  (p) 

217 

7.677 

.375 

,317<a  <.453 

Forms  (f) 

2 

41.104 

.026 

,014<a2<.077 

Items  w/in  forms  (i:f ) 

15 

6.064 

.026 

.01 6  <  a  <054 

pf 

434 

.926 

.097 

.082<a2<.117 

pi:f 

3,255 

.346 

.346 

.333  <  a  <  360 

Peer: 

Persons  (p) 

217 

5.572 

.265 

.222 < a  <.321 

Forms  (f) 

2 

67.000 

.047 

•024<az<.137 

Items  w/ln  forms  (i:f) 

15 

5.481 

.024 

•014<a2<049 

Pf 

434 

.809 

.077 

.064  <  a2  <.094 

pi:f 

3,255 

.350 

.350 

.336<a2  <.364 

Across  rating  sources,  the  largest  variance  is  generally  attributed  to  undifferentiated  error,  £2^.  The 
size  of  this  estimated  variance  component  was  consistent  across  sources.  Other  sources  of  error  variance 
were  small  relative  to  effects  for  undifferentiated  error  and  individual  differences.  The  forms  and 
persons-by-forms  effects  are  both  somewhat  larger  for  supervisory  and  peer  ratings  than  for  self  ratings. 

The  relative  size  of  the  variance  component  for  persons  Is  much  higher  in  these  analyses  than  in  the  full 
design.  This  Is  expected  since  the  universe  of  admissible  observations  is  smaller.  In  other  words,  scores  of 
persons  are  more  generalizabie  within  a  smaller  domain.  Examining  the  relative  size  of  variance  components 
across  sources,  it  can  be  seen  that  universe  variance  is  smallest  for  self  ratings  (o|  =  .193)  and  largest  for 
supervisory  ratings  (a|  =  .375).  Overall,  the  relative  proportion  of  design  variance  attributed  to  Sp  is  larger 
in  the  within-rater  design  than  In  the  full  design. 


D  Studies;  Fuil  Design 

Results  of  simulated  0  study  analyses  of  the  full  design  are  presented  in  Tables  4  and  5.  For  the  analyses 
presented  in  Table  4,  rater  source  was  assumed  to  be  a  random  facet  and  five  different  specifications  of 
measurement  conditions  were  made:  one  rater,  one  8-item  form;  one  rater,  two  8-item  forms;  one  rater,  four 
8-item  forms;  three  raters,  one  8-item  form;  three  raters,  four  8-item  forms.  For  analyses  presented  in  Table 
5,  the  rater  source  facet  was  considered  fixed  (with  three  conditions)  and  four  measurement  specifications 
were  made:  one  8-item  form,  two  4-item  forms,  four  8-item  forms,  and  one  12-item  form.  Assuming  that 
sources  are  fixed  implies  that  the  specific  rater  types  sampled  at  the  G  study  level  exhaust  the  universe  of 
possible  rater  sources  and  that  individuals’  scores  are  standardized  (summed  or  averaged)  over  rating 
sources. 


14 


Table  4.  Simulated  D  Study  Results  of  Full  Design, 
Illustrating  Changes  in  Raters  or  Forms3 


a2  for  pR  (l:F)  design 


nr  1 

1 

1 

3 

3 

a2  for 

Of  1 

2 

4 

1 

4 

pr  (i:f)  design 

nj  8 

8 

8 

8 

8 

sk¬ 

.100 

a|  =  -100 

.100 

.100 

.100 

.100 

it- 

.020 

aft =.020 

.020 

.020 

.007 

.007 

St- 

.033 

aft =.033 

.017 

.008 

.033 

.008 

St,- 

.022 

aftF  =  .003 

.001 

.001 

.003 

.001 

A- 

.278 

4r  =  .278 

.278 

.278 

.093 

.093 

4- 

.022 

afF  =  .022 

.011 

.006 

.022 

.006 

.001 

aftF=  001 

.001 

.000 

.000 

.000 

sj>rt- 

.053 

£|rf=053 

.026 

.013 

.018 

.004 

aft(i;f)  = 

.057 

flp  (l:F)  ~  -007 

.004 

.002 

.007 

.002 

Or(l:f)  = 

.003 

aft(I.F)  ~  -000 

.000 

.000 

.000 

.000 

.290 

aftpm-pn  =  036 

.018 

.009 

.012 

.003 

“■PrV’TJ 

aft  =  .100 

.100 

.100 

.100 

.100 

aft  =  .395 

.337 

.307 

.151 

.107 

g  =  .454 

.376 

.337 

.195 

.123 

a*=.129 

.436 

.407 

.251 

.207 

to2  =  .201 

.229 

.245 

.397 

.482 

$  =.180 

.210 

.228 

.338 

.447 

“Based  on  G  study  results  for  3-form,  6-item  analysis. 


The  notation  used  to  present  the  D  study  results  Is  drawn  from  Brennan  (1983;  Brennan  &  Kane,  1979) 
and  requires  some  explanation.  Whereas  lowercase  letters  are  used  to  present  G  study  variance 
components  (e.g,  a?  and  oftr),  uppercase  letters  represent  the  corresponding  D  study  estimates  (e.g.,  oft 
and  oftp).  Thus,  a  capital  letter  for  a  facet  (recall  that  persons  are  not  a  facet)  signifies  that  the  variance 
component  has  been  averaged  over  the  levels  of  the  effect  indicated  by  the  D  study  measurement  conditions. 
For  example,  with  three  raters  and  four  forms,  the  estimated  variance  component  aft  is  averaged  over  12 
cells  and  aftp= aft  /( 12). 

The  lower  portions  of  Tables  4  and  5  present  estimates  of  universe  score  variance  (aft ),  relative  error 
variance  (a?  ),  absolute  error  variance  (oft),  total  observed  score  variance  (aft),  and  two  generalizability 
coefficients  (so2  and  $}.  Exact  computational  formulas  and  theoretical  explanations  for  these  values  are 
given  in  Brennan  (1983).  Briefly,  universe  score  variance  equals  aft.  In  a  random  model,  the  relative  error 
variance  is  equal  to  the  sum  of  all  effects  which  contain  g  and  at  least  one  other  index  (e.g.,  oftf)  and  the 
absolute  error  variance  is  equal  to  the  sum  of  all  effects  in  the  design  except  aft.  Total  observed  score 
variance  Is  equal  to  universe  score  variance  plus  relative  error  variance.  The  generalizability  coefficient  zS 
is  equal  to  the  ratio  of  aft  to  the  sum  of  universe  score  variance  plus  relative  error  variance,  and  &  is  equal 
to  the  ratio  of  aft  to  the  sum  of  universe  score  variance  plus  absolute  error  variance. 

Looking  first  at  Table  4,  it  can  been  seen  that  in  comparison  to  the  G  study  results,  measurement  error 
is  reduced  considerably  by  replications  of  raters,  forms,  or  items.  For  example,  undifferentiated  error 
variance  (aftn^F) )  drops  from  .290  to  .036  by  simply  averaging  over  eight  items  on  a  single  form  with  a  single 
rater.  This  set  of  conditions  also  decreases  variance  in  all  other  effects  which  contain  the  item  facet  (5|F 
and  .oftftF.  Measurement  specifications  which  contain  more  than  one  rater  or  form  condition  result  in 
reductions  in  error  variance  for  effects  containing  either  the  rater  or  form  facet.  For  example,  variance  due 


to  forms  decreases  from  .033  when  one  form  is  used  to  .008  when  four  forms  are  used.  When  scores  are 
averaged  over  three  sources,  four  forms,  and  eight  items,  there  is  virtually  no  error  variance  due  to  items 
within  forms,  the  source-by-form  interaction,  the  person-by-items-within-forms  interaction,  or  the  sources  by 
items  within  forms  interaction. 


Table  5.  Simulated  D  Study  Results  of  Full  Design, 
Illustrating  Sources  Fixed,  Illustrating  Changes  in  Items3 


a2  for  pR  (l:F)  design 

a2  for  ni  i  2  4  T 


pr  0:f)  design 

J3|  8 

4 

8 

12 

Op  = 

.100 

£■=.100 

.100 

.100 

.100 

5- 

.020 

s*)  b 

sh 

.033 

££  =  .033 

.017 

.008 

.033 

£*,= 

.022 

o?f  =  .003 

.003 

.001 

.002 

f: 

.278 

.022 

An  b 

“lR=t 

OpF  =  022 

.011 

.006 

.022 

&- 

.001 

b 

ss*- 

.053 

5pRF  = 

.007 

.005 

.057 

.003 

002 

.290 

Pm1*1/ 

£2  =.100 

.100 

.100 

.100 

£2=.  029 

.018 

.007 

.027 

4=065 

.038 

.016 

062 

4=129 

.118 

.107 

.126 

=  .774 

.846 

.932 

.789 

$-.604 

.726 

.859 

.616 

‘Based  on  G  study  results  for  3-form,  6-item  analysis. 
bNo  variance  since  scores  are  averaged  over  sources. 


The  lower  half  of  Table  4  presents  universe  variances,  two  error  variances,  total  observed  score  variances, 
and  generalizabillty  coefficients  for  each  set  of  measurement  conditions.  The  universe  score  variance  ) 
remains  the  same  regardless  of  conditions.  It  can  be  seen  that  when  rater  sources  are  considered  random, 
the  generalizability  of  ratings  is  fairly  low  regardless  of  the  measurement  conditions,  so2  (the  generalizability 
for  relative  decisions)  ranges  from  .201  when  one  rater  uses  one  8-item  form  to  .482  with  scores  averaged 
over  three  raters  using  four  8-item  forms.  &  the  generalizability  coefficient  for  absolute  decisions,  is 
somewhat  lower.  Since  the  estimated  variance  component  for  the  persons-by-rater  interaction  is  large,  it 
appears  that  all  three  rating  sources  are  necessary  conditions  to  maximize  generalizability. 

A  more  positive  conclusion  emerges  from  inspection  of  results  in  Table  5,  based  on  the  assumption  of 
fixed  rater  sources.  In  these  analyses,  ratings  are  averaged  over  the  three  sources,  and  the  generalizability 
of  these  means  over  the  facets  of  interest  are  investigated.  Because  scores  are  averaged  over  sources,  there 
can  be  no  variance  due  to  different  sources  and  variance  components  for  effects  which  contain  sources  are 
set  to  zero  for  the  D  study  analyses. 
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When  rater  sources  are  assumed  to  be  fixed,  the  mean  rating  over  sources  is  very  generalizable  under 
a  variety  of  conditions.  _&q2'  the  generalizability  coefficient  for  relative  decisions,  ranges  from  .774  with  one 
8-item  form  to  .932  with  four  8-item  forms.  Corresponding  values  for  &  are  somewhat  lower.  Generalizability 
coefficients  above  .70  are  generally  considered  adequate  for  decision-making  purposes.  It  is  interesting  to 
note  that  while  adding  more  items  is  considerably  easier  than  adding  more  rating  forms,  it  has  less  of  an 
effect  on  the  generalizability  of  ratings.  For  example,  adding  four  additional  items  to  the  single  8-item  form 
will  only  increase  the  generalizability  coefficient  from  .774  to  .789,  but  changing  from  a  single  8-item  form  to 
two  4-item  forms  instead  will  increase  the  coefficient  to  .846  (see  Table  5). 


D  Studies:  Within-Raters 


Results  of  similar  analyses  for  self,  supervisory,  and  peer  ratings  are  presented  in  Tables  6,  7,  and  8. 
Measurement  conditions  were  specified  as  one  4-item  form,  one  8-item  form,  two  8-item  forms,  and  four 
2-item  forms.  Results  are  presented  for  only  the  3-form,  6-item  G  study  analysis,  though  results  were 
comparable  for  both  designs.  Comparing  the  results  for  the  three  ratine  sources,  it  appears  that  ratings  from 
supervisors  are  more  dependable  over  forms  and  items  than  are  ratings  from  incumbents  or  peers.  Ratings 
by  peers  are  also  more  generalizable  than  ratings  by  incumbents  (self  ratings).  However,  regardless  of 
source,  ratings  appear  to  be  fairly  generalizable  within  source  when  at  least  two  8-item  forms  are  used.  The 
generalizability  coefficients  under  these  conditions  for  self,  supervisory,  and  peer  ratings  are  .800,  .843,  and 
.81 5,  respectively.  Coefficients  for  ratings  over  four  2-item  forms  are  not  appreciably  different  from  those  for 
ratings  over  two  6-item  forms. 

Inspection  of  the  D  study  variance  components  for  all  sources  reveals  that  the  persons-by-forms  effect 
is  the  largest  source  of  error  variance.  While  this  effect  is  reduced  by  increasing  the  number  of  forms,  it 
remains  relatively  iarge  in  comparison  to  other  sources.  This  effect  represents  variance  due  to  ratees  being 
differentially  ranked  on  different  forms.  Attention  to  this  problem,  perhaps  during  rater  training,  could  result 
in  more  generalizable  ratings. 


Table  6.  Simulated  D  Study  Results  of  Self  Ratings3 


a2  for  Q, 

p  (1:1)  design  q. 

a2  for  p  (l:F)  design 

1 

4 

1 

8 

2 

8 

4 

2 

$  =  .192 

Op  = 

.192 

.192 

.192 

$=.035 

§= 

.035 

.035 

.017 

.009 

a?f  =  .025 

$F  = 

.006 

.003 

.002 

.003 

$,  =  .053 

.053 

.053 

.026 

.014 

fl2p(l:f)  =  -351 

$(,.*  = 

.088 

.044 

.022 

.044 

.192 

.192 

.192 

.192 

.140 

.096 

.048 

.057 

$  = 

.181 

.134 

.067 

.069 

.332 

.289 

.240 

.248 

£fi2  = 

.578 

.666 

.800 

.774 

4  = 

.515 

.589 

.741 

.738 

“Based  on  G  study  results  for  3-form,  6-item  analysis. 
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Table  7.  Simulated  D  Study  Results  of  Supervisory  Ratings3 


a?  for  p  (1: 

F)  design 

a2 

for 

n, 

1 

1 

2 

4 

P  0:f)  < 

design 

4 

8 

8 

2 

% 

.375 

.026 

f: 

375  " 

.026 

375 

.026 

.375 

.013 

375" 

.007 

.026 

.007 

.003 

.002 

.003 

Onf  = 

.097 

“pF  = 

.097 

.097 

.048 

.024 

.346 

.086 

.043 

.022 

.043 

Op<fcFj  = 

.375 

.375 

.375 

.375 

o|- 

.183 

.140 

.070 

.068 

rf- 

.216 

.170 

.085 

.077 

.558 

.515 

.445 

.443 

£p2  = 

.671 

.728 

.843 

.847 

4»= 

.634 

.689 

.816 

.829 

“Based 

on  G  study  results  for  3 

-form,  6-item  analysis. 

Table  8.  Simulated  D  Study  Results  of  Peer  Ratings8 

ii2  for  p  (l:F)  design 

a: 

2  for 

Dr 

1 

1 

2 

4 

p  (i:f)  design 

Di 

4 

8 

8 

2 

a 

=  .205 

.265 

.265 

.265 

.265 

5 

=  .047 

3- 

.047 

.047 

.023 

.012 

a 

=  .024 

£?F  = 

.006 

.003 

.001 

.003 

=  .077 

4f= 

.077 

.077 

.038 

.019 

=  .350 

«P0F)  = 

.087 

.044 

.022 

.044 

% 

.265 

.265 

.265 

.265 

£2  = 

.164 

.120 

.060 

.063 

rf. 

.217 

.170 

.085 

.077 

4= 

.429 

.385 

.325 

.306 

£fl2  = 

.617 

.688 

.815 

.808 

4  = 

.550 

.609 

.757 

.774 

“Based  on  G  study  results  for  3-form,  6-item  analysis. 
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DISCUSSION 


The  format  of  this  discussion  section  will  be  as  follows.  First,  the  results  will  be  interpreted  within  the 
context  of  both  generalizability  theory  and  performance  appraisal.  This  will  be  followed  by  an  assessment 
of  the  relevancy  of  generalizability  theory  to  the  project  and  some  final  conclusions  and  recommendations. 


Interpretation  of  Results 

The  results  indicate  that  performance  ratings  collected  as  part  of  the  performance  measurement  project 
are  somewhat  generaiizable  across  several  different  universes  of  generalization.  As  would  be  expected, 
ratings  are  more  generaiizable  within  smaller  universes  of  admissible  observations  (e.g..  rating  sources). 
For  the  full  design,  ratings  will  have  an  acceptable  level  of  generalizability  only  when  the  three  rating  sources 
are  assumed  fixed  and  at  least  two  forms  are  used.  Under  these  conditions,  the  generalizability  coefficient 
for  ratings  exceeds  .80,  which  is  the  minimum  acceptable  level  proposed  by  Cardinet,  Tourneur,  and  Allal 
(1976).  There  are  several  ways  of  interpreting  this  generalizability  coefficient.  Literally,  it  is  the  ratio  of 
universe  score  variance  to  observed  score  variance  and  is  analogous  to  the  reliability  coefficient  in  classical 
test  theory.  Alternatively,  it  may  be  understood  as  an  intraclass  correlation  representing  the  average 
correlation  of  observed  deviation  scores  (from  their  overall  mean)  and  universe  deviation  scores  (from  their 
overall  mean).  Perhaps  the  most  straightforward  interpretation  is  that  the  generalizability  coefficient 
represents  the  proportion  of  observed  score  variance  which  can  be  attributed  to  the  object  being  measured. 
Thus,  when  ratings  are  averaged  over  three  rater  sources  each  using  four  8-item  forms,  nearly  90%  of  the 
observed  score  variance  is  due  to  individual  differences,  and  only  10%  is  due  to  measurement  error.  Since 
the  forms  and  items  within  forms  facets  were  considered  random  for  this  analysis,  other  rating  forms  or  items 
could  be  sampled  from  the  same  universe  of  admissible  observations  with  no  change  in  the  generalizability 
coefficient. 

For  D  study  analyses  within  rater  level,  somewhat  smaller  sets  of  measurement  conditions  are  needed 
to  achieve  acceptable  generalizability  coefficients.  Only  two  8-item  forms  are  needed  for  supervisory  and 
peer  ratings  to  produce  generalizability  coefficients  greater  than  .80.  As  the  number  of  items  or  forms  is 
increased  further,  generalizability  coefficients  for  supervisory  and  peer  ratings  may  exceed  .90.  It  should  be 
noted  that  while  these  coefficients  are  higher  than  for  the  full  design,  the  universe  of  generalizability  is  smaller, 
as  it  is  limited  to  only  a  single  rater  source. 

For  all  analyses,  there  are  several  notable  results  involving  individual  variance  components.  The  largest 
variance  component  in  each  G  study  was  the  residual  term.  This  term  represents  the  confounding  of  the 
most  complex  interaction  term  and  undifferentiated  error.  Presumably,  the  size  of  this  term  could  be 
decreased  by  identifying  other  potential  sources  of  measurement  error  (e.g.,  measurement  occasions)  and 
including  them  in  the  design.  Although  this  strategy  does  not  reduce  total  error  variance  (it  only  reapportions 
the  error  variance),  it  does  serve  to  explicitly  identify  each  source.  Once  the  degree  of  error  due  to  different 
facets  is  known,  it  can  be  controlled  in  future  testing  administrations  through  more  careful  measurement 
procedures  or  by  averaging  scores  across  conditions  of  the  facet. 

It  should  also  be  noted  that  there  is  very  little  error  variance  due  to  sampling  of  forms.  This  finding  is 
very  similar  to  results  of  covariance  structural  modeling  analyses  by  Vance  et  al.  (1988),  who  reported  that 
the  traits  measured  by  different  forms  are  nearly  perfectly  correlated  when  measurement  error  is  controlled 
for.  The  present  data  are  more  persuasive  though,  since  Vance  et  al.  investigated  only  relationships  of  means 
over  all  form  items.  The  present  analyses  included  items  as  a  separate  facet,  yet  still  found  no  effect  for 
forms  alone.  These  results  also  substantiate  the  conclusions  of  Jacobs,  Kafry,  and  Zedeck  (1980),  who 
reported  that  different  rating  forms  had  little  effect  on  measurement  quality. 

In  general,  the  present  results  also  compare  favorably  to  those  of  other  studies  of  generalizability  theory 
and  performance  appraisal.  In  a  generalizability  study  of  faculty  ratings  of  dental  students',  Littlefield  et  al. 
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(1977)  reported  generalizability  coefficients  in  the  .70's  for  two  raters  and  five  items.  Two  separate  studies 
of  student  evaluations  of  teachers  reported  by  Kane  et  al.  (1 976)  yielded  generalizability  coefficients  between 
.70  and  .82  for  10  items  and  10  raters.  In  another  study  of  course  evaluations  by  Gillmore  et  ai.  (1978),  the 
generalizability  of  scores  over  5  raters  and  10  items  was  only  .59.  Thus,  the  present  results  are  at  least  as 
high  or  higher  than  many  generalizability  coefficients  typically  reported  in  the  literature. 


Logical  Requirements  for  Performance  Ratings 

Dickinson  (1986)  demonstrated  how  random  effects  in  an  analysis  of  variance  design  can  be  interpreted 
in  terms  of  logical  requirements  for  performance  appraisal.  This  perspective  can  be  applied  to  the  present 
results  as  well.  One  primary  extension  of  generalizability  theory  beyond  Dickinson’s  analyses  is  that  variance 
components  can  be  assessed  at  the  D  study  level  as  well  as  for  G  studies  (Dickinson’s  level  of  analysis). 

After  Lawler  (1967),  Kavanagh,  MacKInney,  and  Wolins  (1971),  and  others,  Dickinson  (1986)  argued  that 
performance  ratings  should  possess  validity  as  do  other  measures  of  Individual  differences.  Specifically, 
ratings  should  be  shown  to  have  high  convergent  validity  (among  methods  or  sources),  moderately  high 
discriminant  validity  (across  rating  dimensions),  and  low  method  bias  (i.e.,  method  of  rating  affects  ratee 
ordering).  Dickinson  then  demonstrated  how  indices  of  these  logical  requirements  can  be  calculated  through 
analysis  of  variance  applied  to  multiple  measures  of  ratees. 

Convergent  validity  is  indicated  by  the  variance  component  for  those  effects  which  interact  with  persons 
(e.g.,  anf  or  £,r).  These  variance  components  indicate  the  degree  to  which  ratees  are  ordered  invariantly 
over  differentrorms  or  rating  sources.  The  £,f  is  relatively  small,  indicating  convergent  validity  over  rating 
forms.  In  contrast,  is  relatively  high,  indicating  a  lack  of  convergence  between  rating  sources. 
Generalizability  coefficients  within  sources  are  satisfactory,  suggesting  that  ratings  are  dependable  (or 
reliable)  within  sources,  but  not  across  sources. 

Discriminant  validity  is  indicated  by  the  variance  component  for  the  interaction  of  persons  and  items 
within  forms  (apj;f).  Effective  raters  recognize  individual  strengths  and  weaknesses  in  ratees  and  discriminate 
among  these  attributes  when  making  ratings.  In  the  present  analyses,  apl;f  was  moderately  high  for  both 
the  6-item,  3-form  and  the  2-item,  4-form  G  study  analyses,  suggesting  reasonable  evidence  of  discriminant 
validity.  A  similar  analysis  could  not  be  made  within  rater  sources  as  the  variance  component  a  P(j:f)  contains 
both  the  person  by  items  within  forms  effect  and  undifferentiated  error. 


Assessment  of  Generalizability  Theory 

The  present  application  illustrated  several  strengths  and  weaknesses  of  generalizability  theory.  On  the 
positive  side,  it  showed  how  a  multifaceted  approach  to  measurement  error  can  aid  decision-makers  in 
refining  an  instrument  Specifically,  It  was  found  in  D  study  analyses  of  the  full  design  that  variance  due  to 
the  ratee-rater  and  ratee-rater-form  interactions  were  fairly  large,  even  when  averaging  across  three  rater 
sources.  Because  this  is  a  multivariate  approach,  it  is  known  that  this  error  exists  independent  of  variance 
due  to  forms  or  items  on  forms.  If  the  Air  Force  considers  these  to  be  undesirable  types  of  error  variance, 
these  errors  can  be  treated  either  statistically  by  increasing  the  number  of  rater  sources  (though  this  might 
not  be  practical)  or  methodologically  by  an  intervention  in  the  instrument  development  or  administration 
stage.  For  example,  the  rater  training  process  could  be  altered. 

Another  advantage  is  that  generalizability  theory  yields  indices  of  rater  dependability  in  situations  in  which 
classical  test  theory  approaches  may  be  unable  to  do  so.  For  example,  the  primary  index  of  rater  quality  in 
classical  theory  is  an  assessment  of  interrater  agreement,  but  this  requires  at  least  two  raters.  In  situations 
in  which  only  a  single  rater  exists,  generalizability  theory  can  still  generate  other  measures  of  rater 
dependability  in  terms  of  forms,  items,  occasions,  etc. 
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A  very  important  practical  application  is  that  generalizability  theory  allows  decision-makers  to  assess  the 
generalizability  of  an  instrument  under  conditions  other  than  the  ones  in  current  use.  In  the  present  study, 
the  only  analysis  of  the  full  design  which  could  be  run  involved  four  items  on  two  forms.  In  all  likelihood,  the 
Air  Force  would  never  actually  collect  ratings  under  these  conditions.  Yet  this  analysis  provided  trustworthy 
estimates  of  variance  components  under  a  variety  of  conditions  involving  larger  numbers  of  forms  or  items. 
This  information  could  be  very  useful  for  decisions  about  altering  the  current  assessment  process.  Ideally, 
information  from  a  generalizability  study  could  be  combined  with  utility  data  to  make  rational  decisions  about 
how  many  raters,  forms,  or  items  should  be  used.  For  example,  simulated  D  study  results  for  the  full  design 
revealed  that  all  three  rating  sources  are  necessary  to  ensure  dependable  ratings.  However,  there  is  not  a 
great  gain  in  generalizability  from  increasing  the  number  of  items.  Utility  analyses  could  be  used  to  select 
the  most  cost-efficient  form,  and  rater  time  demands  could  be  lessened  by  using  only  a  single  form. 

A  final  advantage  comes  from  the  process  of  conducting  a  generalizability  study.  As  mentioned  above, 
generalizability  theory  forces  researchers  and  decision-makers  to  explicitly  address  measurement  issues 
which  are  too  often  ignored.  For  example,  what  are  ail  the  conditions  of  measurement  which  could  affect 
observations  of  individuals?  How  can  these  be  measured  and  controlled?  Are  the  measurement  instruments 
to  be  used  for  relative  or  absolute  decision-making?  Are  particular  measurement  conditions  to  be  considered 
random  samples  of  a  larger  set  of  possible  conditions  or  do  they  exhaust  the  set?  Will  the  same  set  of 
conditions  always  be  used,  or  might  a  smaller  or  larger  set  be  used  in  the  future?  Perhaps  the  most  valuable 
contribution  is  that  generalizability  theory  refocuses  our  attention  on  the  reliability  of  our  measures  and 
reminds  us  that  without  this  basic  property,  our  measures  cannot  be  valid  for  other  purposes. 

One  potential  problem  is  the  accuracy  of  estimated  variance  components.  Indeed,  Shavelson  and  Webb 
(1981)  called  sampling  errors  in  variance  components  the  "Achilles’  heel"  of  generalizability  theory.  In  the 
present  study,  some  confidence  intervals  were  large  enough  to  include  zero  due  to  the  size  of  their  standard 
errors.  Proposed  solutions  to  the  sampling  error  problem  include  large  samples,  large  numbers  of  conditions 
for  each  facet,  and  the  use  of  multiple  designs  (Smith,  1978).  Forthe  Air  Force,  there  should  not  be  a  problem 
in  obtaining  large  samples  for  many  occupational  specialties,  but  increasing  the  number  of  conditions  would 
probably  be  impossible  since  the  instruments  are  already  designed.  Moreover,  it  would  be  infeasible  to  give 
raters  nine  or  ten  20-item  forms  just  to  ensure  more  accurate  estimates  of  variance  components.  The  third 
solution  Is  more  workable  though,  since  the  Air  Force  will  be  collecting  identical  rating  data  from  additional 
occupational  specialties.  The  same  analyses  should  be  repeated  on  each  specialty  and  the  results 
compared.  Then,  variance  components  could  be  averaged  across  specialties  for  more  accurate  estimates 
of  population  parameters. 

More  importantly,  the  question  must  be  raised  again  as  to  whether  the  objectives  and  benefits  of 
generalizability  theory  are  consistent  with  the  current  directions  of  the  Air  Force  performance  measurement 
project.  According  to  Kavanagh  et  at.  (1986),  the  Air  Force  will  assess  the  quality  of  rating  data  primarily  in 
terms  of  construct  validity  and  accuracy.  This  should  include  results  of  generalizability  theory  which  can  be 
interpreted  as  evidence  of  convergent  validity  or  discriminant  validity.  However,  considerable  other  evidence 
(e.g.,  content  validity,  correlations  with  nonrating  data)  must  be  accumulated  before  fully  informed  decisions 
can  be  made  about  the  construct  validity  of  these  measures.  Further,  without  known  target  scores  included 
in  a  design,  generalizability  analyses  would  have  little  relevancy  to  the  accuracy  of  the  rating  measures.  The 
application  of  generalizability  theory  to  questions  of  accuracy  when  such  scores  are  available  is  an  interesting 
problem  and  is  discussed  below. 

Generalizability  theory  is  first  and  foremost  a  useful  method  for  assessing  the  dependability  of  scores 
over  conditions  of  measurement.  It  is  probably  most  useful  if  used  early  in  the  instrument  development  stage 
when  decision-makers  still  have  some  latitude  in  determining  future  measurement  conditions.  However,  it 
can  also  be  useful  after  the  instrument  is  in  operation,  for  modifying  or  optimizing  existing  procedures.  For 
example,  the  present  study  strongly  demonstrates  the  need  to  retain  all  three  rating  sources  in  order  to 
maximize  the  generalizability  of  the  ratings.  As  discussed  in  the  introduction,  if  an  instrument  cannot  be 
shown  to  be  test  reliable  or  dependable  over  measurement  conditions,  questions  of  validity  or  accuracy  are 
moot. 


21 


RECOMMENDATIONS 


1.  It  is  important  that  the  performance  measurement  project  continue  to  collect  ratings  from  ail  three 
rating  sources.  Levels  of  generalizability  are  unacceptably  low  across  sources  unless  data  from  ail  three 
rating  sources  are  collected  and  averaged. 

2.  Generalizability  analyses  should  be  applied  to  other  occupational  specialties,  permitting  the 
comparison  of  estimated  variance  components  and  generalizability  coefficients  across  specialties.  If 
specialties  were  a  facet,  the  variance  due  to  specialties  could  be  specified.  If  the  variance  component  for 
specialties  were  smaii,  variance  components  for  other  effects  could  be  combined  over  specialties  for  more 
accurate  estimates  of  population  parameters.  Alternatively,  Monte  Carlo  studies  could  be  performed  to 
identify  the  total  number  of  designs  (or  subjects  across  designs)  needed  to  compute  accurate  population 
estimates. 

3.  Generalizability  analyses  can  also  be  applied  to  the  Walk-Through  Performance  Test  component  of 
the  Job  Performance  Measurement  Project.  This  would  seem  to  be  a  fruitful  endeavor  since  there  are 
currently  tremendous  manpower  costs  due  to  the  length  of  the  tests.  Generalizability  theory  could  be  used 
to  assess  changes  in  the  dependability  of  scores  with  reductions  In  the  number  of  administrators,  tasks, 
dimensions,  or  tests  themselves  (i.e,  elimination  of  the  interview  or  hands-on  component). 

4.  The  application  of  generalizability  theory  to  questions  of  construct  validity  could  be  explored  further 
by  investigating  the  consistency  of  scores  over  different  types  of  performance  measures  (i.e.,  ratings, 
hands-on,  interviews).  This  application  of  generalizability  theory  has  been  recommended  by  Morse  and 
Morse  (1976),  who  also  recommended  using  binary  pass/fail  scores  on  each  criterion.  Evidence  of  construct 
validity  would  come  from  the  percentage  of  incumbents  who  "pass”  the  walk-through  and  also  receive 
"passing"  proficiency  ratings.  Alternatively,  data  from  the  Walk-Through  and  performance  ratings  could  be 
used  to  investigate  questions  of  rater  accuracy.  If  the  Walk-Through  Is  truly  a  benchmark;  then  accurate 
raters  are  those  whose  ratings  most  closely  approximate  these  ratings.  Dickinson  (1986)  has  also  shown 
how  analysis  of  variance  designs  can  be  used  to  draw  conclusions  about  the  accuracy  of  performance 
ratings.  Similar  conclusions  can  be  reached  through  generalizability  theory.  For  example,  if  supervisory 
ratings  and  Walk-Through  scores  are  compared,  a  small  variance  component  for  sources  could  be 
interpreted  as  what  Cronbach  (1955)  termed  "elevation  accuracy,"  or  the  extent  of  agreement  between  the 
ratings  and  the  target  scores. 

5.  Information  about  the  generalizability  of  measures  should  be  combined  with  other  information  in 
making  decisions  about  the  usefulness  of  various  criterion  measures.  For  example,  the  cost  of  each  measure 
can  be  expressed  as  a  function  of  the  time  and  expense  necessary  to  develop  the  measure  and/or  collect 
data  using  the  measure.  The  benefit  of  any  set  of  measures  can  be  expressed  as  a  function  of  their 
generalizability  levels  and  any  possible  attenuating  effects  on  the  relationship  of  these  measures  to  the  Armed 
Services  Vocational  Aptitude  Battery  (ASVAB).  These  data  can  be  used  to  derive  informed  answers  to 
questions  about  criterion  measures,  such  as:  Which  combination  of  criterion  measures  can  be  collected 
the  most  inexpensively  without  adversely  affecting  our  ability  to  validate  the  ASVAB? 
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